COVID-19 Peru - ML Model & Analytics
Before we begin please note anything bolded is either a finding or an important comment
Business Understanding
This initial analysis is meant to visualize and forecast COVID-19 spread in Peru.
Peru COVID-19 dataset updated on: 14Apr20
Data Understanding
After having an understanding on what is our business problem, initial datasets are loaded for data preparation and exploration prior model development.
Libraries & Functions
Let’s begin by loading the libraries and functions that will be used for this analysis:
## plyr dplyr
## TRUE TRUE
## data.table readxl
## TRUE TRUE
## stringr stringi
## TRUE TRUE
## forecast tidyverse
## TRUE TRUE
## caret zoo
## TRUE TRUE
## plotly DT
## TRUE TRUE
## rpart tidyr
## TRUE TRUE
## inspectdf DataExplorer
## TRUE TRUE
## ggfortify magrittr
## TRUE TRUE
## ggpmisc prophet
## TRUE TRUE
## ggplot2 tsoutliers
## TRUE TRUE
## dlm reshape2
## TRUE TRUE
## AppliedPredictiveModeling PerformanceAnalytics
## TRUE TRUE
## ggdendro rgdal
## TRUE TRUE
## tsibble
## TRUE
Data Collection
Raw data for each Peru COVID-19 dataset are loaded.
#load raw data
## Load data
load("rda/COVID_19_Peru_Raw.rda")
load("rda/COVID_19_Peru_Raw_Inc2day.rda")Data Summary
The COVID-19 dataset (COVID_19_Peru_Raw) has 40 rows and 12 columns while COVID-19 dataset case increase every 2 days has 20 rows and 3 columns
Exploratory Data Analysis (EDA)
After raw data is loaded, I proceed to perform EDA on dataset.
Data Cleaning
For this step, I perform a data cleaning if needed.
## # A tibble: 1 x 9
## rows columns discrete_columns continuous_colu~ all_missing_col~
## <int> <int> <int> <int> <int>
## 1 40 12 1 11 0
## # ... with 4 more variables: total_missing_values <int>,
## # complete_rows <int>, total_observations <int>, memory_usage <dbl>
## # A tibble: 1 x 9
## rows columns discrete_columns continuous_colu~ all_missing_col~
## <int> <int> <int> <int> <int>
## 1 20 3 1 2 0
## # ... with 4 more variables: total_missing_values <int>,
## # complete_rows <int>, total_observations <int>, memory_usage <dbl>
Daily cumulative incidence
First, let’s look at the daily cumulative number of incident, lab-confirmed cases for Lima, for all the other provinces combined, and for all of Peru.
The initial increases for Lima and the balance of the other provinces, and for all of Peru look to be approximately sigmoidal, as is expected for epidemic spread. Let’s plot them on a logarithmic y axis. We would expect to see a linear increase on a log scale if the epidemic curve is indeed exponential.
One could convince oneself that log-linearity is present.
Daily incremental incidence
Let’s also look at the daily incremental incidence. This is more informative, and is known in outbreak epidemiological parlance as the epidemic curve. It is traditionally visualised as a bar chart, which emphasises missing data more than a line chart, even one with points as well. We’ll adhere to epidemiological tradition.
At the time of writing (13th April), it looks like incidence has yet to plateaued in Peru.
Daily cumulative and incremental deaths in lab-confirmed cases
Now let’s look the daily (incremental) number of deaths in lab-confirmed cases for all provinces in Peru combined.
Let’s also look at the daily incremental deaths in lab-confirmed cases.
Clearly daily counts of deaths are continuing to rise despite the fact that the daily count of new cases is now falling. This is not surprising, given that it takes some time for cases to either recover or to die, and therefore the trend in deaths will necessarily lag behind any trend in daily incidence.
Fitting an SIR model to the Peru data
On 4 February 2020, data science blogger Learning Machines posted this analysis of the COVID-19 outbreak, in which he fitted the classic SIR (Susceptible-Infectious-Recovered) model to the incidence data for all of China. I’ll use his explanation of how to fit this model using R as based for this analysis.
The basic idea behind the SIR model of communicable disease outbreaks is that there are three groups (also called compartments) of people: those who are healthy but susceptible to the disease \(S\), the infectious (and thus, infected) \(I\) and people who have recovered \(R\):
Source: wikipedia
To model the dynamics of the outbreak we need three differential equations, to describe the rates of change in each group, parameterised by \(\beta\) which controls the transition between \(S\) and \(I\) and \(\gamma\) which controls the transition between \(I\) and \(R\):
\[\frac{dS}{dt} = - \frac{\beta I S}{N}\]
\[\frac{dI}{dt} = \frac{\beta I S}{N}- \gamma I\]
\[\frac{dR}{dt} = \gamma I\]
The first step is to express these differential equations as an R function, which is easier than one might think – the expressions in the code are just direct translations of the differential equations, with respect to time \(t\) of course.
SIR <- function(time, state, parameters) {
par <- as.list(c(state, parameters))
with(par, {
dS <- -beta * I * S / N
dI <- beta * I * S / N - gamma * I
dR <- gamma * I
list(c(dS, dI, dR))
})
}To fit the model to the data we need two things: a solver for these differential equations and an optimiser to find the optimal values for our two unknown parameters, \(\beta\) and \(\gamma\). The function ode() (for ordinary differential equations) from the deSolve package for R makes solving the system of equations easy, and to find the optimal values for the parameters we wish to estimate, we can just use the optim function built into base R. Specifically what we need to do is minimise the sum of the squared differences between \(I(t)\), which is the number of people in the infectious compartment \(I\) at time \(t\), and the corresponding number of cases as predicted by our model \(\hat{I}(t)\). This quantity is known as the residual sum of squares (RSS)1.
\[RSS(\beta, \gamma) = \sum_{t} \left( I(t)-\hat{I}(t) \right)^2\]
I now proceed to fit a model to the incidence data for all of Peru. We need a value \(N\) for the initial uninfected population. Once again we’ll scrape population data from a suitable wikipedia page.
The approximate population of Peru in 2017 was 31,237,385 people, according to this wikipedia page.
Next, we need to create a vector with the daily cumulative incidence for Peru, from 6th March when our daily incidence data starts, through to current date. We’ll then compare the predicted incidence from the SIR model fitted to these data with the actual incidence since 6th March. We also need to initialise the values for \(S\), \(I\) and \(R\).
# put the daily cumulative incidence numbers for Peru from 7th Mar to 14th
# Apr into a vector called Infected
sir_start_date <- "2020-03-07"
Infected <- Peru_Cumulative_Incidence %>% filter(province == "Peru_cases_increase",
Date >= ymd("2020-03-07"), Date <= ymd("2020-04-14")) %>% pull(cumulative_incident_cases)
# Create an incrementing Day vector the same length as our cases vector
Day <- 1:(length(Infected))
# now specify initial values for S, I and R
init <- c(S = N - Infected[1], I = Infected[1], R = 0)Then we need to define a function to calculate the \(RSS\), given a set of values for \(\beta\) and \(\gamma\).
# define a function to calculate the residual sum of squares (RSS), passing
# in parameters beta and gamma that are to be optimised for the best fit to
# the incidence data
RSS <- function(parameters) {
names(parameters) <- c("beta", "gamma")
out <- ode(y = init, times = Day, func = SIR, parms = parameters)
fit <- out[, 3]
sum((Infected - fit)^2)
}Finally, we can fit the SIR model to our data by finding the values for \(\beta\) and \(\gamma\) that minimise the residual sum of squares between the observed cumulative incidence and the predicted cumulative incidence. We also need to check that our model has converged, as indicated by the message shown below:
# now find the values of beta and gamma that give the smallest RSS, which
# represents the best fit to the data. Start with values of 0.5 for each,
# and constrain them to the interval 0 to 1.0
Opt <- optim(c(0.5, 0.5), RSS, method = "L-BFGS-B", lower = c(0, 0), upper = c(1,
1))
# check for convergence
Opt$message## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
Convergence is confirmed. Now we can examine the fitted values for \(\beta\) and \(\gamma\).
## beta gamma
## 0.5996020 0.4003982
Those values don’t mean a lot, per se, but let’s use them to get the fitted numbers of people in each compartment of our SIR model for the dates up to 14th April that were used to fit the model, and compare those fitted values with the observed data.
# time in days for predictions, I will use 60 days forecast
t <- 1:as.integer(today() + 60 - ymd(sir_start_date))
# get the fitted values from our SIR model
fitted_cumulative_incidence <- data.frame(ode(y = init, times = t, func = SIR,
parms = Opt_par))
# add a Date column and join the observed incidence data
fitted_cumulative_incidence <- fitted_cumulative_incidence %>% mutate(Date = ymd(sir_start_date) +
days(t - 1), province = "Peru_cases_increase") %>% left_join(Peru_Cumulative_Incidence %>%
ungroup() %>% filter(province == "Peru_cases_increase") %>% select(Date,
cumulative_incident_cases))
# plot the data
f <- fitted_cumulative_incidence %>% filter(Date <= ymd("2020-04-14")) %>% ggplot(aes(x = Date)) +
geom_line(aes(y = I), colour = "red") + geom_point(aes(y = cumulative_incident_cases),
colour = "blue") + labs(y = "Cumulative incidence", title = "COVID-19 fitted vs observed cumulative incidence, Peru",
subtitle = "(red = fitted incidence from SIR model, blue = observed incidence)") +
theme_bw()
ggplotly(f)That looks like a reasonably good fit to the observed cumulative incidence data, so we can now use our fitted model to calculate the basic reproduction number \(R_{0}\) which gives the average number of susceptible people who are infected by each infectious person:
\[R_{0} = \frac{\beta}{\gamma}\]
That’s very easy to calculate, and we get:
## R0
## 1.497514
An \(R_{0}\) > 1.5 is consistent the values calculated by others for COVID-19, and is also consistent with the \(R_{0}\) for SARS and MERS, which are similar diseases also cause by coronavirus.
Using the SIR model for Peru to make predictions
An obvious next step is to use our fitted SIR model to make predictions about the future course of the outbreak. However, caution is required, because the SIR model assumes a fixed reproduction number, but if public health interventions have been implemented, such as quarantining of cases, contact tracing and isolation of those contacts, and general restrictions on social mixing, such as closing the country, then the effective reproduction number \(R_{e}\) will be dynamic and should fall as those interventions are progressively implemented, to values considerably less than the basic reproduction number \(R_{0}\), which reflects the behaviour of the virus at the beginning of an epidemic before any response has been implemented.
So let’s use our SIR model, fitted to the first 40 days of data, to extrapolate out to the current date, and compare that against the observed values:
We can see that the actual incidence similar than that predicted by our model. The reason is that, even with public health interventions implemented by the Peruvian authorities, the \(R_{e}\) of the COVID-19 in Peru hasn’t been reduced by much. This is something that will need to be addressed ASAP.
When the \(R_{e}\) falls below 1.0, the peak of the epidemic will have been reached and the outbreak will eventually die out.
Using our model to let the outbreak “run its course” without intervention
It is instructive to use our model fitted to the first 38 days of available data on lab-confirmed cases in Peru, to see what would happen if the outbreak were left to run its course, without public health interventions.
It is easier to see what is going on if we use a log scale:
Clearly that prediction, should it come to pass, would be an unmitigated disaster. At this point, it is worth remarking on the importance of decisive public health intervention to limit the spread of such epidemics. Without such interventions, tens of millions of people could be infected, as our model predicts, and even with only a one or two per cent mortality rate, hundreds of thousands of people would die. The economic and human cost of the outbreak control steps taken within China are large, but the alternatives are a lot worse!
Ascertainment rates
So far, we have assumed that the counts of lab-confirmed cases represent all the cases that are infectious. This is unlikely to be true – typically only a proportion of actual cases are detected or found or sent for testing. This proportion is known as the ascertainment rate. The ascertainment rate is likely to change during the course of an outbreak, particularly if surveillance and screening efforts are increased, or if case definitions are changed. Such changing ascertainment rates can be easily incorporated into our model by using a set of weights, or a weighting function, for our incidence data, but for the sake of simplicity, let’s see what happens if we assume a fixed ascertainment rate of 20%2. If we apply that, thus inflating the number of incident cases by a factor of 5, and refit our model, we get the following results.
# put the daily cumulative incidence numbers for Hubei from 6th Mar to 14th
# Apr into a vector called Infected
sir_start_date <- "2020-03-07"
Infected <- Peru_Cumulative_Incidence %>% filter(province == "Peru_cases_increase",
Date >= ymd("2020-03-07"), Date <= ymd("2020-04-14")) %>% pull(cumulative_incident_cases)
# Apply a fixed 20% ascertainment rate
Infected <- Infected * 5
# Create an incrementing Day vector the same length as our cases vector
Day <- 1:(length(Infected))
# now specify initial values for S, I and R
init <- c(S = N - Infected[1], I = Infected[1], R = 0)
RSS <- function(parameters) {
names(parameters) <- c("beta", "gamma")
out <- ode(y = init, times = Day, func = SIR, parms = parameters)
fit <- out[, 3]
sum((Infected - fit)^2)
}
Opt <- optim(c(0.5, 0.5), RSS, method = "L-BFGS-B", lower = c(0, 0), upper = c(1,
1))
# check for convergence
Opt$message## [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
## beta gamma
## 0.5997458 0.4002552
## R0
## 1.498409
Note that these fitted parameters are the same as the ones we got above, without an ascertainment rate adjustment. Let’s look at the fitted values.
Perhaps counter-intuitively, incorporation of a fixed 20% ascertainment rate adjustment into our model makes little difference to the modelled outbreak if it is let run its course, except that it all happens a bit more quickly. But the number of infections remains the same, and the basic reproduction number is unchanged. Note that that is for a fixed ascertainment rate. If the ascertainment rate varies significantly over time, then the parameter estimates will necessarily be biased – but in the early days of an outbreak, it may be reasonable to assume that ascertainment rates don’t change too much.
Epidemic trajectory model using log-linear models
As noted above, the initial exponential phase of an outbreak, when shown in a semi-log plot (the y-axis with a logarithmic transform), looks somehow linear. This suggests that we may be able to model epidemic growth and decay using a simple log-linear model:
\[log(y) = rt + b\]
where \(y\) is the incidence, \(r\) is the growth rate, \(t\) is the number of days since a specific point in time (typically the start of the outbreak), and \(b\) is the intercept. Separate models are fitted to the growth and the decay parts of the epidemic (incidence data) curve.
I proceed to find the peak of confirmed possitive cases based on current available data.
Peru_incidence_peak <- find_peak(Peru_incidence_object)
plot(Peru_incidence_object) + geom_vline(xintercept = Peru_incidence_peak, col = "red",
lty = 2) + labs(title = "Daily number of lab-confirmed possitive cases in Peru",
subtitle = "(red line indicates date of peak incidence)") + theme_bw()Now I proceed to fit two log-linear models, one to the growth phase before the peak, and one to the decay phase after the peak. We can plot the fitted values from our models (with confidence limits) on top of the actual observed possitive incidence data for Peru.
Peru_incidence_fit <- incidence::fit(Peru_incidence_object, split = Peru_incidence_peak)
# plot the incidence data and the model fit
plot(Peru_incidence_object) %>% add_incidence_fit(Peru_incidence_fit) + labs(title = "Daily number of lab-confirmed possitive cases in Peru",
subtitle = "Observed VS Modeled") + theme_bw()From the model, we can extract various parameters of interest: the growth rate prior to the peak was 0.17 (95% CI 0.15 - 0.19), and barely noticible decay rate after the peak of -1.47.
This growth rates is equivalent to a doubling time of 4.1 days (95% CI 3.7 - 4.6 days).
The doubling time estimate is quite ilustrative on informing current public health intervention policy in Peru.
It is also possible to fit SIR and related models by MLE.↩︎
In a very interesting paper, Nishiura et al. state: “There were two interesting features of the evacuation process conducted by Japanese authorities. First, all 565 passengers [repatriated to Japan from Wuhan] were screened for symptoms upon arrival. Prior to disembarkation, the quarantine officers used portable thermoscanners to screen the temperature. In the following, all passengers were interviewed regarding whether they have any symptoms suggestive of upper respiratory tract infection, including fever and cough. Of the 565 passengers, 63 were symptomatic. Second, the passengers were tested for presence of 2019‐nCoV using reverse transcription polymerase chain reaction (RT‐PCR), and eight passengers (1.4%) were determined to have the virus. Importantly, most of the individuals positive for the virus were asymptomatic (five passengers), while only the other three had symptoms consistent with 2019‐nCoV infection, indicating that the dataset generated from this process can help investigate the full spectrum of infection including mild and asymptomatic infections.”. Thus, they found about 60% of those infected with the COVID-19 virus were essentially asymptomatic, and thus likely to be undetected and unascertained.↩︎